Fault Tolerance in MPI Programs

نویسنده

  • William Gropp
چکیده

This paper examines the topic of writing fault-tolerant MPI applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude that within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fault Tolerance in Message Passing Interface Programs

In this paper we examine the topic of writing fault-tolerant Message Passing Interface (MPI) applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude ...

متن کامل

Failure Resilient Heterogeneous Parallel Computing Across Multidomain Clusters

We propose lightweight middleware solutions that facilitate and simplify the execution of failure-resilient MPI programs across multidomain clusters. The system described in this paper leverages H2O, a distributed metacomputing framework, to route MPI message passing across heterogeneous aggregates located in different administrative or network domains. MPI programs instantiate a specially writ...

متن کامل

Star sh Fault Tolerant Dynamic MPI Programs on Clusters of Workstations

This paper reports on the architecture and design of Star sh an environment for executing dynamic and static MPI programs on a cluster of work stations Star sh is unique in being e cient fault tolerant highly available and dynamic as a system internally and in supporting fault tolerance and dy namicity for its application programs as well Star sh achieves these goals by combining group communic...

متن کامل

In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes

Today, the scale of High performance computing (HPC) systems is much larger than ever. This brings a challenge to fault tolerance of HPC systems. MPI (Message Passing Interface) is one of the most important programming tools for HPC. There are quite a few fault-tolerant extensions for MPI, such as MPICH-V, StarFish, FT-MPI and so on. Most of them are based on on-disk checkpointing. In this pape...

متن کامل

Collective Operations in an Application-level Fault Tolerant MPI System

The running times of many computational science programs are now significantly greater than the mean-time-betweenfailures (MTBF) of the hardware they run on. Therefore, fault-tolerance is becoming a critical issue on highperformance platforms. Checkpointing is a technique for making programs fault tolerant by periodically saving their state and restoring this state after failure. In system-leve...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002